feat(search): add BM25 ranked text search by shaunpatterson · Pull Request #9652 · dgraph-io/dgraph

shaunpatterson · 2026-03-04T21:30:27Z

Summary

Adds BM25 relevance-ranked text search to Dgraph as a new query function and filter
New @index(bm25) schema directive and bm25(predicate, "query") DQL syntax
Results sorted by BM25 relevance score (IDF-weighted term frequency with document length normalization)
Supports custom k/b tuning parameters: bm25(pred, "query", "1.5", "0.5")
Works as both root function and @filter

Changes

File	Description
`tok/tok.go`	BM25Tokenizer with duplicate-preserving tokens and TokensWithFrequency
`tok/tokens.go`	GetBM25QueryTokens with deduplication and encoding
`x/keys.go`	BM25IndexKey, BM25DocLenKey, BM25StatsKey helpers
`posting/index.go`	addBM25IndexMutations, updateBM25Stats write path
`worker/task.go`	handleBM25Search with full BM25 scoring engine
`worker/tokens.go`	BM25 verification and token generation wiring
`dql/parser.go`	Register "bm25" as valid function name
`tok/tok_test.go`	12 unit tests for tokenizer, frequencies, stemming, stopwords
`query/query_bm25_test.go`	12 integration tests for queries, ordering, filters, pagination
`query/common_test.go`	BM25 test schema and 7 test documents

BM25 Formula

score(doc, term) = IDF(term) * (k+1) * tf / (k * (1 - b + b * docLen/avgDL) + tf)
IDF(term) = log1p((N - df + 0.5) / (df + 0.5))

Default: k=1.2, b=0.75 (Lucene/Elasticsearch variant with non-negative IDF)

Test plan

Unit tests pass: go test ./tok/... -run TestBM25 -v (12 tests)
go vet clean on all modified packages
All existing tests continue to pass
Integration tests: go test -tags integration -run TestBM25 ./query/ -v (requires Docker cluster)

🤖 Generated with Claude Code

Add BM25 relevance-ranked text search to Dgraph, enabling users to query text predicates and receive results ordered by relevance score instead of boolean matching. Implementation: - New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem) that preserves term frequencies for TF counting - BM25-specific index storage: per-term TF posting lists, doc length lists, and corpus statistics (doc count, total terms) - Query execution with full BM25 scoring: score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf) IDF = log1p((N - df + 0.5) / (df + 0.5)) - DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter - Schema syntax: @index(bm25) - Parameter validation (k > 0, 0 <= b <= 1) - Early UID intersection for filter-mode performance - All-stopword document and query handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three critical bugs fixed: 1. REF postings lose Value during rollup: The posting list encode/rollup cycle strips the Value field from REF postings without facets (list.go:1630). BM25 term frequencies and doc lengths were stored in Value and lost. Fix: Store TF and doclen as facets on REF postings, which are preserved. 2. Missing function validation: query/query.go has a separate isValidFuncName check from dql/parser.go. "bm25" was only added to the parser, causing "Invalid function name: bm25" at query time. 3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score, but the query pipeline (algo.MergeSorted, child predicate fetching) requires UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply first/offset pagination on score-sorted results before UID sorting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact varint-encoded binary blobs stored as direct Badger KV entries (~4-6 bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for variable-based score ordering following the similar_to pattern. - Add posting/bm25enc package for compact binary encode/decode - Rewrite write path in posting/index.go for direct Badger KV - Add bm25Writes buffer to LocalCache with read-your-own-writes - Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta - Rewrite read path in worker/task.go with direct blob decoding - Add bm25_score pseudo-predicate in query/query.go - Add score ordering integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…cases Cover incremental add/update/delete, IDF score stability as corpus grows, large corpus pagination, unicode, stopwords, uid filtering, score validation, and concurrent batch adds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…c tests Addresses test coverage gaps identified during code review against ArangoDB's BM25 implementation: - TestBM25ExactScoreValues: validates numerical correctness of BM25 formula using b=0 to enable hand-computed expected scores - TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization and contrasts with default b=0.75 behavior - TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 1 of BM25 scaling plan. Introduces bm25block package with: - BlockMeta/Dir types for block directory encoding/decoding - SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks - MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal - ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation - New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey, BM25DocLenBlockKey for block-addressed Badger KV storage 17 unit tests and benchmarks for the block storage format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phases 2-4 of BM25 scaling plan: Phase 2 - Segmented mutation path: - addBM25IndexMutations now writes to block-based storage - Each term's postings split into ~128-entry blocks with a directory - Blocks automatically split when exceeding 256 entries - Doc-length list also uses block-based storage - Block removal and directory cleanup on deletes Phase 3 - WAND top-k query path: - New bm25wand.go with listIter for block-based posting list iteration - WAND algorithm with min-heap for top-k early termination - Per-block upper bounds (UBPre) computed from maxTF at query time - Suffix-max UBPre for efficient threshold checking - Falls back to scoring all docs when no first: limit or offset is used Phase 4 - Block-Max WAND: - skipToWithBMW skips entire blocks whose UB + other terms can't beat theta - Avoids Badger reads for blocks that can't contribute to top-k - Enabled by default in handleBM25Search Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 5 - Migration support: - newListIter falls back to legacy monolithic blob when no block directory exists - lookupDocLen falls back to legacy BM25DocLenKey blob - wandSearch falls back to legacy BM25IndexKey for df computation - Legacy data transparently served through synthetic single-block directory - New writes always use block format; old data works until overwritten Unit tests for WAND components: - TestTopKHeapBasic: heap operations, threshold, eviction - TestTopKHeapTieBreaking: deterministic ordering on score ties - TestBm25ScoreFunction: formula verification, tf/dl/b edge cases - TestBm25ScoreNaN: no NaN/Inf for edge-case inputs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes critical bugs and performance issues identified by GPT-5 review: - Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against inBlockPos < 0 (possible before first next() call) - Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now skip empty blocks instead of leaving iterator in invalid state with MaxUint64 pivotDoc - Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer backwards, could cause re-scoring or infinite loops) - Fix remainingUB panic: guard against blockIdx < 0 (before first next()) - Add docLenCache: caches doclen directory + block reads within a single query, avoiding repeated Badger reads per scored document - Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of iterating all other terms (O(q^2) -> O(q)) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…UB underestimate Three fixes: 1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen blocks before incrementing stats, preventing double-counting on SET when the document was already indexed (defensive guard for batch mutations). 2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot), so BMW's otherUB calculation is correct and won't skip valid candidate blocks. 3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads (directory was read once for df, then again inside newListIter). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add DecodeCount() to bm25enc for O(1) entry count reads without full decode, preventing OOM on legacy migration with large posting lists (e.g., common terms with millions of entries) - Use DecodeCount in WAND search legacy DF calculation path - Fix integer overflow in DecodeDir bounds check by using uint64 arithmetic (prevents panic on corrupted data with MaxUint32 count) - Pre-allocate shared score buffer in handleBM25Search with three-index slices to prevent accidental append corruption - Document bm25Writes concurrency model and limitations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

matthewmcneely · 2026-05-07T17:22:49Z

Shaun, thank you for this. The depth of work here is obvious: WAND/Block-Max, length normalization, the test coverage, the careful concurrency notes.

I'm going to decline it, and I want to be transparent about why so the time isn't lost.

The decision isn't really about BM25 as a feature. It's that landing it as proposed would mean carrying a parallel storage and retrieval stack inside Dgraph, alongside the one we already have:

A second write path on LocalCache (bm25Writes) that bypasses posting lists, deltas, and rollup, with its own BitBM25Data user-meta and its own commit branch in mvcc.go.
Two new packages (posting/bm25block, posting/bm25enc) that re-implement sorted-UID blocks, splits, and value encoding — work the posting list already does.
A 600+ line WAND engine inside worker/, plus a bm25_score pseudo-predicate threaded through the query planner via ParentVars.

The note in lists.go is telling: "two transactions updating different UIDs that share a common term could theoretically race… If higher write concurrency is needed, blocks should be integrated into the posting list delta mechanism." That's a real concurrency regression relative to every other tokenizer, mitigated only by leaning on Raft serialization. And the cleanest fix is exactly the integration this PR chose not to do.

Every other tokenizer (term, fulltext, trigram, ngram, geo, hash) lands as a Tokenizer impl in tok/ plus a FuncType case in worker/task.go. The reason this PR is +3471 / −4 is that it isn't really shaped like a tokenizer, it's a retrieval engine sitting next to the existing one. Maintaining two indexing pipelines, two write paths, and two concurrency stories is the cost we'd be signing up for in perpetuity.

If you're up for it, the version we'd be down to review would:

Express BM25 postings as a normal posting list — term frequency as the value, document length stored next to the predicate's existing length/count machinery.
Drop bm25block/bm25enc in favor of the existing block format and let MVCC do the delta merging.
Land scoring as a tokenizer + a query-side ranker, without a pseudo-predicate or new ParentVars channel.
Keep WAND if it pays off, but as an optimization on top of the standard posting list iterator, not a replacement for it.

That's a much smaller PR — and it inherits all of Dgraph's existing concurrency, snapshot, backup, and rollup behavior for free. Happy to sketch the integration points if useful.

Either way, thank you again — this is clearly a lot of careful work, and the analysis you did (especially around the IDF variant, length normalization, and the block-max upper bounds) is exactly the kind of rigor we want on a feature like this.

shaunpatterson requested a review from a team as a code owner March 4, 2026 21:30

shaunpatterson and others added 3 commits March 4, 2026 23:08

shaunpatterson force-pushed the sp/bm25 branch from 0ac6b07 to a61b226 Compare March 5, 2026 04:08

shaunpatterson and others added 9 commits March 4, 2026 23:51

fix(bm25): clamp startPos in skipTo to prevent negative sort.Search l…

9093cb0

…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shaunpatterson force-pushed the sp/bm25 branch from 3bab47d to 31e70e6 Compare March 19, 2026 02:04

matthewmcneely closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): add BM25 ranked text search#9652

feat(search): add BM25 ranked text search#9652
shaunpatterson wants to merge 12 commits intodgraph-io:mainfrom
shaunpatterson:sp/bm25

shaunpatterson commented Mar 4, 2026 •

edited

Loading

Uh oh!

matthewmcneely commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

shaunpatterson commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

BM25 Formula

Test plan

Uh oh!

matthewmcneely commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

shaunpatterson commented Mar 4, 2026 •

edited

Loading